2025/05/24
| Model | Large-Scale RL? | AIME 2024 | MATH 500 | GPQA Diamond | Code Forces | |
|---|---|---|---|---|---|---|
| GPT-4o (OpenAI 2024) | ❌ | 44.6 | 60.3 | 50.6 | >11.0% | |
| o1 (OpenAI 2024) | ✅ | 74.4 | 94.8 | 77.3 | >89.0% | |
| R1 (DeepSeek-AI 2025) | ✅ | 79.8 | 97.3 | 71.5 | >96.3% |
OpenAI (2025):
Deep research independently discovers, reasons about, and consolidates insights from across the web.
To accomplish this, it was trained on real-world tasks requiring browser and Python tool use,
using the same reinforcement learning methods behind OpenAI o1, our first reasoning model.
Check OpenAI Deep Research’s demo video for more details.
We can model Reinforcement Learning (RL) as a complex dataflow graph, consisting of:
RL with Large Language Models (LLMs) is even more challenging.
| Paradigm | Pro | Con |
|---|---|---|
| Single-Controller | Flexible | Commnucation Overhead |
| Multi-Controller | Efficient | Complex Programming |
Which paradigm should we choose?
We can have both!
Hybrid-Controller = Single-Controller + N x Multi-Controller
Single-controller enables verl to implement various RL algorithms by only modifying a few lines, usually only in the fit function.
See the example changing PPO to GRPO:
for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
batch = critic.compute_values(batch)
batch = compute_advantage(batch, "gae")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
# DELETED: batch = critic.compute_values(batch)
batch = compute_advantage(batch, "grpo")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)In this section, we discuss how verl optimizes the multi-controllers, i.e., compositions of traditional LLM training workloads.
The optimal parallelism strategies for different workloads are usually different, e.g.,
| Implementation | Pro | Con |
|---|---|---|
| Fixed GPUs for each workload | Simple | Many bubbles |
| Hybrid Engine | Efficient | Need resharding |
Parallelism Algorithms:
Training Backend:
Generation Backend:
For the most timely updates of important features, please keep an eye on verl’s README.
verl is working on supporting efficient RL training for huge MoE like DeepSeek-V3-671B, based on the following features:
GPTModel class for actor and criticFor more details, please check our tracker #708.
Welcome to join the verl community to discuss and contribute
To enable this, use use_remove_padding.
To enable this, use balance_batch.
However, in gradient accumulation,
To resolve this, verl supports to
To enable this, use use_dynamic_bsz.
Multi-turn generation might have different implementations:
Specifically, verl integrates:
Engine.async_generate (contributed by the SGLang RL team)AsyncLLM (contributed by Xibin Wu from ByteDance)enable_gradient_checkpointing)use_torch_compile)use_liger)A canonical RL dataset in verl has the following fields:
prompt: a list of messages {"role": "...", "content": "..."}data_source: used to choose the reward functionreward_model: a dict containing
"ground_truth""style" like "model" or "rule"extra_info: a dict containing extra informationFor VLM RL, verl expects fields "images" and/or "videos"
For examples, please check the examples/data_preprocess.
You could also customize the field names via config. Please check the data section in config files like ppo_trainer.yaml for more details.
For further customization, verl provides the data.custom_cls config,
The custom dataset class defined in the .py file is required to accept the following initialization parameters:
verl allows to define custom reward function via the custom_reward_function config:
The function defined in .py should accept the parameters passed from the reward manager __call__ method. Taking NaiveRewardManager as an example:
For more complex features, you can also add a new reward manager like PRIMERewardManager or DAPORewardManager.
To modify the loss function, the most convenient way is to
.backward() callcompute_policy_lossentropy_lossFor example, the DataParallelPPOActor.update_policy method defines the loss function as follows:
class DataParallelPPOActor(BasePPOActor):
def update_policy(self, data: DataProto):
pg_loss = compute_policy_loss(
old_log_prob=old_log_prob, log_prob=log_prob,
advantages=advantages, # ...
)
entropy_loss = agg_loss(loss_mat=entropy)
policy_loss = pg_loss - entropy_loss * entropy_coeff
kld = kl_penalty(
logprob=log_prob, ref_logprob=ref_log_prob, # ...
)
kl_loss = agg_loss(loss_mat=kld)
policy_loss = policy_loss + kl_loss * self.config.kl_loss_coef
loss.backward()As mentioned above, the main training logic is concentrated in the fit function of the trainer classes like RayPPOTrainer.
For example, the DAPORayTrainer class overrides the fit function to implement the “dynamic sampling” feature:
(See the next slide for the code ➡️)
class RayDAPOTrainer(RayPPOTrainer):
def fit(self):
for epoch in range(self.config.trainer.total_epochs):
batch = None
for batch_dict in self.train_dataloader:
new_batch = DataProto.from_single_dict(batch_dict)
num_gen_batches += 1
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
new_batch = new_batch.union(gen_batch_output)
if not self.config.algorithm.filter_groups.enable:
batch = new_batch
else:
# Getting `kept_traj_idxs` ...
new_batch = new_batch[kept_traj_idxs]
batch = new_batch if batch is None else DataProto.concat([batch, new_batch])
prompt_bsz = self.config.data.train_batch_size
if num_prompt_in_batch < prompt_bsz:
max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
continue
else:
traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
batch = batch[:traj_bsz]
# ...verl: Flexible and Efficient RL for LLMs